Compact In-Memory Models for Compression of Large Text Databases

نویسندگان

  • Justin Zobel
  • Hugh E. Williams
چکیده

For compression of text databases, semi-static wordbased models are a pragmatic choice. They provide good compression with a model of moderate size, and allow independent decompression of stored documents. Previous experiments have shown that, where there is not sufficient memory to store a full word-based model, encoding rare words as sequences of characters can still allow good compression, while a pure character-based model is poor. In addition, there are other kinds of semi-static model that can be used for text, such as word pairs. We propose a further kind of model that reduces main memory costs of a word-based model: approximate models, in which rare words are represented by similarly-spelt common words and a sequence of edits. We investigate the compression available with different memory efficient models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can significantly improve the compression available in limited memory and greatly reduce overall memory requirements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sux Array 9=@.%"%k%4%j%:%'$nhf3s Sux Array $,$"$k!#$3$l$oj8;zns$na4$f$n@\hx<-$n%]%$%s%?$r<-=q=g$k3jg<$7$?g[ns$g!" Comparison among Sux Array Construction Algorithms

Sux array is a compact data structure for searching matched strings from text databases. It is an array of pointers and stores all suxes of a text in lexicographic order. Because its memory requirement is less than tree structures, it is eective for large databases. Moreover, constructing the sux array is used in the Block Sorting compression scheme. We compare algorithms for constructing sux a...

متن کامل

A limited memory adaptive trust-region approach for large-scale unconstrained optimization

This study concerns with a trust-region-based method for solving unconstrained optimization problems. The approach takes the advantages of the compact limited memory BFGS updating formula together with an appropriate adaptive radius strategy. In our approach, the adaptive technique leads us to decrease the number of subproblems solving, while utilizing the structure of limited memory quasi-Newt...

متن کامل

Combining Text Compression and String Matching: The Miracle of Self-Indexing

This decade has witnessed the raise of what I consider the most important breakthrough of modern times in text compression and indexed string matching. Selfindexing is the mechanism by which a text is simultaneously compressed and indexed, so that the self-index occupies space close to that of the compressed text, provides random access to any part of it, and in addition supports efficient inde...

متن کامل

Text Compression for Dynamic Document Databases

For compression of text databases, semi-static word-based methods provide good performance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Second, the need to handle document insertions means that the collection must be periodically recompressed, if compression efficiency is to be m...

متن کامل

Implementation of VlSI Based Image Compression Approach on Reconfigurable Computing System - A Survey

Image data require huge amounts of disk space and large bandwidths for transmission. Hence, imagecompression is necessary to reduce the amount of data required to represent a digital image. Thereforean efficient technique for image compression is highly pushed to demand. Although, lots of compressiontechniques are available, but the technique which is faster, memory efficient and simple, surely...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999